feat: enable GRPO training with logprobs from offline trajectory data #467

JRMeyer · 2025-11-29T02:37:24Z

Summary

This PR enables proper GRPO training with importance sampling when using offline trajectory data (e.g., from vLLM traces). It includes four complementary changes:

1. Extract logprobs from dict messages

Problem: ART's tokenizer only extracted logprobs from OpenAI Choice objects, but offline trajectory data often stores logprobs in plain Python dicts. This caused all dict message logprobs to be set to NaN, making the importance ratio = 1.0 always (effectively REINFORCE instead of GRPO).

Solution: Modified tokenize.py to also extract logprobs from dict messages that have the format {"logprobs": {"content": [{"logprob": -0.5}, ...]}}.

2. Strip logprobs before RULER scoring

Problem: When trajectories contain verbose logprobs data, sending them to the RULER judge causes context length errors.

Solution: Strip logprobs from trajectories before sending to RULER using strip_logprobs().

3. Preserve `_internal_config.engine_args`

Problem: When using TrainableModel._internal_config.engine_args to configure vLLM engine settings (like max_logprobs), the configuration was silently lost when using the SkyPilot backend.

Solution: Add a model_validator(mode="wrap") to preserve _internal_config during Pydantic deserialization.

4. Add importance sampling observability metrics

Problem: ART computes importance sampling ratios internally but doesn't expose them, making it impossible to verify if importance sampling is actually working.

Solution: Add three new metrics logged during training:

frac_old_logprobs_valid: Fraction of old logprobs that are not NaN (0 = no importance sampling)
mean_importance_ratio: Mean π_new/π_old across assistant tokens (should vary around 1.0)
clip_fraction: Fraction of tokens where PPO clipping was triggered (>0 means off-policy correction active)

Impact

Aspect	Before	After
Importance ratio	1.0 always (for dict messages)	`π_new / π_old`
PPO clipping	Never activates	Activates when ratio outside [0.8, 1.2]
Algorithm	REINFORCE	GRPO with importance sampling
Observability	None	`frac_old_logprobs_valid`, `mean_importance_ratio`, `clip_fraction`

New Metrics Interpretation

Metric	Working (GRPO)	Not Working (REINFORCE)
`frac_old_logprobs_valid`	> 0 (close to 1.0)	= 0 (all NaN)
`mean_importance_ratio`	varies around 1.0	exactly 1.0
`clip_fraction`	> 0	= 0

Test plan

Verified max_logprobs setting works with SkyPilot backend
Ran ./scripts/run_checks.sh - all checks pass
Test with training that uses offline trajectory data with logprobs
Verify new metrics appear in training logs/wandb

Add a runtime warning when users pass engine-initialization-only arguments (max_logprobs, gpu_memory_utilization, tensor_parallel_size, max_model_len) via OpenAIServerConfig.engine_args. These arguments are silently ignored because the vLLM engine is initialized by Unsloth before OpenAIServerConfig is applied. The warning guides users to use TrainableModel._internal_config instead.

The _internal_config field was being lost when TrainableModel was deserialized from JSON (e.g., when sent from client to SkyPilot backend). This is because Pydantic ignores fields starting with underscore during model_validate(). Added a model_validator(mode="wrap") that extracts _internal_config from the input data before validation and sets it after the model is created. This fixes the "Cannot request more than 0 logprobs" error when using _internal_config.engine_args with remote backends. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Adds three new metrics logged during training to help users verify that importance sampling is working correctly: - frac_old_logprobs_valid: Fraction of old logprobs that are not NaN - mean_importance_ratio: Mean π_new/π_old across assistant tokens - clip_fraction: Fraction of tokens where PPO clipping was triggered These metrics help diagnose whether GRPO/PPO importance sampling is active or if training has fallen back to vanilla REINFORCE (when all logprobs are NaN). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Supports three formats in priority order: 1. New format: token_ids + logprobs.values (direct arrays) 2. Old format: logprobs.content with token_id:XXX parsing 3. No logprobs: re-tokenize with NaN logprobs Fixes token count mismatch that caused frac_old_logprobs_valid=0. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Skip trajectories where the final assistant message is stripped by the chat template (e.g., when it only contains <think> content), causing continue_final_message=True to fail. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

JRMeyer force-pushed the fix/warn-engine-args-in-openai-server-config branch from dd383f5 to 5b26fd9 Compare December 1, 2025 17:31

JRMeyer changed the title ~~fix: preserve _internal_config.engine_args when using SkyPilot backend~~ feat: enable GRPO training with logprobs from offline trajectory data Dec 1, 2025

JRMeyer force-pushed the fix/warn-engine-args-in-openai-server-config branch from 63d68c0 to e859dd9 Compare December 3, 2025 01:51

JRMeyer and others added 7 commits December 5, 2025 11:43

fix: strip logprobs from trajectories before sending to RULER judge

e456b20

feat: extract logprobs from dict messages for GRPO importance sampling

04857dc

style: fix import sorting in ruler.py

3d79de3

chore: remove design doc from PR

0a64c7c

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

JRMeyer force-pushed the fix/warn-engine-args-in-openai-server-config branch from e859dd9 to 0a64c7c Compare December 5, 2025 19:45

JRMeyer and others added 2 commits December 5, 2025 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enable GRPO training with logprobs from offline trajectory data #467

feat: enable GRPO training with logprobs from offline trajectory data #467

Uh oh!

JRMeyer commented Nov 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: enable GRPO training with logprobs from offline trajectory data #467

Are you sure you want to change the base?

feat: enable GRPO training with logprobs from offline trajectory data #467

Uh oh!

Conversation

JRMeyer commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Extract logprobs from dict messages

2. Strip logprobs before RULER scoring

3. Preserve _internal_config.engine_args

4. Add importance sampling observability metrics

Impact

New Metrics Interpretation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JRMeyer commented Nov 29, 2025 •

edited

Loading

3. Preserve `_internal_config.engine_args`